#### CS 61C: Great Ideas in Computer Architecture

Lecture 12: Control & Operating Speed

John Wawrzynek & Nick Weaver <a href="http://inst.eecs.berkeley.edu/~cs61c/sp18">http://inst.eecs.berkeley.edu/~cs61c/sp18</a>

#### Agenda

- Finish Single-Cycle RISC-V Datapath
- Controller
- Instruction Timing
- Performance Measures
- Introduction to Pipelining
- Pipelined RISC-V Datapath
- And in Conclusion, ...

# Recap: Adding branches to datapath



#### Implementing **JALR** Instruction (I-Format)

| 31          | 20 19 | 15 14 12 | 11   | 7 6    | 0 |
|-------------|-------|----------|------|--------|---|
| imm[11:0]   | rsl   | funct3   | rd   | opcode |   |
| 12          | 5     | 3        | 5    | 7      |   |
| offset[11:0 | base  | 0        | dest | JALR   |   |

- JALR rd, rs, immediate
  - Writes PC+4 to Reg[rd] (return address)
  - Sets PC = Reg[rs1] + immediate
  - Uses same immediates as arithmetic and loads
    - no multiplication by 2 bytes

#### Datapath with Branches



## Adding jalr to datapath



## Adding jalr to datapath



## Implementing jal Instruction



- JAL saves PC+4 in Reg[rd] (the return address)
- Set PC = PC + offset (PC-relative jump)
- Target somewhere within ±2<sup>19</sup> locations, 2 bytes apart
  - ±2<sup>18</sup> 32-bit instructions
- Immediate encoding optimized similarly to branch instruction to reduce hardware cost

# Adding jal to datapath



### Adding jal to datapath



#### Recap: Complete RV32I ISA

| LUI   | 0113111 | rd          | mm[31:12] |      |               |             |  |
|-------|---------|-------------|-----------|------|---------------|-------------|--|
| AUIP  | 0013111 | rd          |           |      | mm[31:12]     |             |  |
| JAL   | 1101111 | rd          |           | 9:12 | 20 10:1 11 1: | imm[        |  |
| JALR  | 1100111 | rd          | 900       | rs1  |               | imm [11:0]  |  |
| BEQ   | 1100011 | imm[4:1 11] | 000       | rsl  | rs2           | imm[12]10:5 |  |
| BNE   | 1100011 | imm[4:1[11] | 001       | rs1  | rs2           | imm[12]10:5 |  |
| BLT   | 1100011 | imm[4:1 11] | 100       | re1  | rs2           | imm[12]10:5 |  |
| BGE   | 1100011 | imm[4:1 11] | 101       | rs1  | rs2           | imm[12]10:5 |  |
| BLTU  | 1100011 | imm[4:1 11] | 110       | rs1  | rs2           | imm[12]10:5 |  |
| BGEU  | 1100011 | imm[4:1[11] | 111       | rs1  | rs2           | imm 12 10:5 |  |
| LB    | 0000011 | rd          | 000       | rs1  |               | imm [11:0]  |  |
| LH    | 0000011 | rd          | 001       | rsl  |               | imm [11:0]  |  |
| LW    | 0003011 | rd          | 010       | rs1  |               | imm [11:0]  |  |
| LBU   | 0000011 | rd          | 100       | rs1  |               | imm [11:0]  |  |
| LHU   | 0000011 | rd          | 101       | rsl  |               | imm[11:0]   |  |
| SB    | 0100011 | imm(4:0)    | 000       | rs1  | rs2           | imm[11:5]   |  |
| SH    | 0100011 | imm[4:0]    | 001       | rsl  | rs2           | imm[11:5]   |  |
| sw    | 0100011 | imm[4:0]    | 010       | ral  | rs2           | imm[11:5]   |  |
| ADDI  | 0013011 | rd          | 000       | rsl  |               | imm [11:0]  |  |
| SLTI  | 0013011 | rd          | 010       | rsl  |               | imm [11:0]  |  |
| SETTU | 0013011 | rd          | 011       | rsl  |               | imm 11:0    |  |
| XOFI  | 0013011 | rd          | 100       | rs1  |               | imn [11:0]  |  |
| ORI   | 0013011 | rd          | 110       | rs1  |               | imm [11:0]  |  |
| ANDI  | 0013011 | rd          | 111       | rs1  |               | imm [11:0]  |  |

| 000000 | 0          | shamt | rs1   | 001 | rd     | 001001  |
|--------|------------|-------|-------|-----|--------|---------|
| 000000 | 0          | shamt | rsl   | 101 | rd     | 001001  |
| 01(000 | 0          | shamt | rsl   | 101 | rd     | 001001  |
| 000000 | 0          | rs2   | rs1   | 000 | rd     | 011001  |
| 010000 | 0          | rs2   | rsl   | 000 | rd     | 011001  |
| 000000 | 00         | rs2   | rsl   | 001 | rd     | 011001  |
| 000000 | 10         | rs2   | rs1   | 010 | rd     | 011001  |
| 000000 | 0          | rs2   | rsl   | 011 | rd     | 011001  |
| 000000 | 0          | rs2   | rsl   | 100 | rd     | 011001  |
| 000000 | 0          | rs2   | rsl   | 101 | rd     | 011001  |
| 01(000 | 0          | rs2   | rsl   | 1)1 | rd     | 011001  |
| 000000 | 10         | re2   | re1   | 110 | rd     | 011001  |
| 000000 | 0          | rs2   | rs1   | 111 | rd     | 011001  |
| 0000   | pred       | Stace | 00000 | 000 | 00000  | 000111  |
| 0000   | 0000       | 0000  | 00000 | 001 | 00000  | 000111  |
| 00     | 0000000000 | )     | 00000 | 000 | .00000 | 111001  |
| :00    | 000000000  | 10.   | 00000 | 030 | 00000  | 111001  |
|        | CST        | N 1 1 | rsl   | 014 | rd     | 111001  |
|        | CST        | TOVI  | in C  | 000 | rd     | -111001 |
|        | CST        |       | rsl   | 011 | rd     | 111001  |
|        | CST        |       | zimm  | 101 | rd     | 111001  |
|        | CST        |       | zimm  | 110 | rd     | 111001  |
|        | CST        |       | zimm  | 111 | rd     | 111001  |

RV32I has 47 instructions total 37 instructions covered in CS61C

Other instructions (ex: lui, auipc) can be implemented with no new additions to the datapath and only by changing control

ERREAK CSRRW CSRRS CSRRC CSRRWI CSRRSI CSRRCI

SLLI SRLI SRAI ADD SUB SLL SLT SLTU XOR SRL SRA OR AND FENCE

# Single-Cycle RISC-V RV32I Datapath



#### Clicker Question

What are proper control signals for **lui** instruction?

A: BSel=0, ASel=0, WBSel=0

B: BSel=0, ASel=0, WBSel=1

C: BSel=0, ASel=1, WBSel=1

D: BSel=1, ASel=1, WBSel=1



# Implementing lui



#### Agenda

- Finish Single-Cycle RISC-V Datapath
- Controller
- Instruction Timing
- Performance Measures
- Introduction to Pipelining
- Pipelined RISC-V Datapath
- And in Conclusion, ...

#### **Processor**



# Single-Cycle RISC-V RV32I Datapath



#### Control Logic Truth Table (incomplete)

| Inst[31:0] | BrEq | BrLT | PCSel | ImmSel | BrUn | ASel | BSel | ALUSel | MemRW | RegWEn | WBSel |
|------------|------|------|-------|--------|------|------|------|--------|-------|--------|-------|
| add        | *    | *    | +4    | *      | *    | Reg  | Reg  | Add    | Read  | 1      | ALU   |
| sub        | *    | *    | +4    | *      | *    | Reg  | Reg  | Sub    | Read  | 1      | ALU   |
| (R-R Op)   | *    | *    | +4    | *      | *    | Reg  | Reg  | (Op)   | Read  | 1      | ALU   |
| addi       | *    | *    | +4    | 1      | *    | Reg  | lmm  | Add    | Read  | 1      | ALU   |
| lw         | *    | *    | +4    | I      | *    | Reg  | Imm  | Add    | Read  | 1      | Mem   |
| sw         | *    | *    | +4    | S      | *    | Reg  | Imm  | Add    | Write | 0      | *     |
| beq        | 0    | *    | +4    | В      | *    | PC   | Imm  | Add    | Read  | 0      | *     |
| beq        | 1    | *    | ALU   | В      | *    | PC   | lmm  | Add    | Read  | 0      | *     |
| bne        | 0    | *    | ALU   | В      | *    | PC   | lmm  | Add    | Read  | 0      | *     |
| bne        | 1    | *    | +4    | В      | *    | PC   | Imm  | Add    | Read  | 0      | *     |
| blt        | *    | 1    | ALU   | В      | 0    | PC   | Imm  | Add    | Read  | 0      | *     |
| bltu       | *    | 1    | ALU   | В      | 1    | PC   | Imm  | Add    | Read  | 0      | *     |
| jalr       | *    | *    | ALU   | I      | *    | Reg  | Imm  | Add    | Read  | 1      | PC+4  |
| jal        | *    | *    | ALU   | J      | *    | PC   | Imm  | Add    | Read  | 1      | PC+4  |
| auipc      | *    | *    | +4    | U      | *    | PC   | Imm  | Add    | Read  | 1      | ALU   |

# Instruction type encoded using only 9 bits inst[30],inst[14:12], inst[6:2]

|             | mm[31:12]      |     |     | rd          | 0113111 | LUI   |
|-------------|----------------|-----|-----|-------------|---------|-------|
|             | mm[31:12]      |     |     | rd          | 0010111 | AUIPO |
| imm         | 20 10:1 11 19: | 12  |     | rd          | 1101111 | JAL   |
| imm[11:0]   |                | rsl | 900 | rd          | 1100111 | JALR  |
| imm[12]10:5 | rs2            | rsl | 000 | imm[4:1 11] | 1100011 | BEC   |
| imm[12]10:5 | rs2            | rs1 | 001 | imm[4:1 11] | 1100011 | BNE   |
| imm[12]10:5 | rs2            | re1 | 100 | imm[4:1[11] | 1100011 | BLT   |
| imm[12]10:5 | rs2            | rs1 | 101 | imm[4:1 11] | 1100011 | BGE   |
| imm[12]10:5 | rs2            | rs1 | 110 | imm[4:1 11] | 1100011 | BLTU  |
| imm 12 10:5 | rs2            | rsl | 111 | imm[4:1[11] | 1100011 | BGEU  |
| imm[11:0]   |                | rs1 | 000 | rd          | 0000011 | LB    |
| imm[11:0]   |                | rsl | 001 | rd          | 0000011 | LH    |
| imm[11:0]   |                | rs1 | 010 | rd          | 0000011 | LW    |
| imm [11:0]  |                | rsl | 100 | rd          | 0000011 | LBU   |
| imm[11:0]   |                | rsl | 101 | rd          | 0000011 | LHU   |
| imm[11:5]   | rs2            | rs1 | 000 | imm[4:0]    | 0100011 | SB    |
| imm[11:5]   | rs2            | rsl | 001 | imm[4:0]    | 0100011 | SH    |
| imm[11:5]   | rs2            | ral | 010 | imm[4:0]    | 0100011 | sw    |
| imm 11:0    |                | rsl | 000 | rd          | 0013011 | ADDI  |
| imm[11:0]   |                | rsl | 010 | rd          | 0013011 | SLTI  |
| imm [11:0]  |                | rsl | 011 | rd          | 0010011 | SETIU |
| imm[11:0]   |                | rs1 | 100 | rd          | 0013011 | XOFI  |
| imm[11:0]   |                | rs1 | 110 | rd          | 0010011 | ORI   |
| imm[11:0]   |                | rs1 | 111 | rd          | 0010011 | ANDI  |

| inst[30] |          |        | in    | st[14   | :12]   | inst[6:2 | .]   |
|----------|----------|--------|-------|---------|--------|----------|------|
|          | ,        |        |       | V       | _      | <u> </u> |      |
| 00000    | 0        | shamt  | rs1   | 001     | rd     | 0010011  | SLLI |
| 000000   | 0        | shamt  | rsl   | 101     | rd     | 0010011  | SRLI |
| 01(000   | 0        | shamt  | rsl   | 101     | rd     | 0010011  | SRAI |
| 000000   | 0        | rs2    | rs1   | 000     | rd     | 0110011  | ADD  |
| 010000   |          | rs2    | rsl   | 000     | rd     | 0110011  | SUB  |
| 000000   | 0        | rs2    | rsl   | 001     | rd     | 0110011  | SLL  |
| 000000   | 0        | rs2    | rsl   | 010     | rd     | 0110011  | SLT  |
| 000000   | 0        | rs2    | rsl   | 011     | rd     | 0110011  | SLTU |
| 000000   | 0        | rs2    | rsl   | 100     | rd     | 0110011  | XOR  |
| 000000   |          | rs2    | rsl   | 101     | rd     | 0110011  | SRL  |
| 01(000   | 0        | rs2    | rsl   | 1)1     | rd     | 0110011  | SRA  |
| .000000  | 0        | re2    | re1   | 110     | rd     | 0110011  | OR   |
| 000000   | 0        | rs2    | rsl   | 111     | rd     | 0110011  | AND  |
| 0000     | pred     |        | 00000 | 000     | 00000  | 0001111  | FENC |
| 0000     | 0000     | 0000   | 00000 | 001     | 00000  | 0001111  | FENC |
| 000      | 00000000 | 00     | 00000 | 000     | .00000 | 1110011  | ECAL |
| 000      | 00000000 | 01     | 00000 | 000     | 00000  | 1110011  | EBRE |
|          | CST      | NI - 1 | rsl   | - All 1 | rd     | 1110011  | CSRR |
|          | CST      | JOVI   | in C  | DOT (   | rd     | 1110011  | CSRR |
|          | CST      |        | rsl   | 011     | rd     | 1110011  | CSRR |
|          | CST      |        | zimm  | 1)1     | rd     | 1110011  | CSRR |
|          | CST      |        | zimm  | 110     | rd     | 1110011  | CSRR |
|          | CST      |        | zimm  | 111     | rd     | 1110011  | CSRR |

#### Control Block Design



15 data bits (outputs)

#### **Control Realization Options**

#### ROM

- "Read-Only Memory"
- Regular structure
- Can be easily reprogrammed
  - fix errors
  - add instructions
- Popular when designing control logic manually
- Combinatorial Logic
  - Today, chip designers use logic synthesis tools to convert truth tables to networks of gates

## **ROM Controller Implementation**



Controller output (PCSel, ImmSel, ...)

#### Agenda

- Finish Single-Cycle RISC-V Datapath
- Controller
- Instruction Timing
- Performance Measures
- Introduction to Pipelining
- Pipelined RISC-V Datapath
- And in Conclusion, ...

### **Instruction Timing**



| IF     | ID       | EX     | MEM    | WB     | Total  |
|--------|----------|--------|--------|--------|--------|
| I-MEM  | Reg Read | ALU    | D-MEM  | Reg W  |        |
| 200 ps | 100 ps   | 200 ps | 200 ps | 100 ps | 800 ps |

#### **Instruction Timing**

| Instr | IF = 200ps | ID = 100ps | ALU = 200ps | MEM=200ps | WB = 100ps | Total |
|-------|------------|------------|-------------|-----------|------------|-------|
| add   | X          | X          | X           |           | X          | 600ps |
| beq   | X          | Χ          | X           |           |            | 500ps |
| jal   | Χ          | Χ          | X           |           |            | 500ps |
| lw    | Х          | Х          | Х           | Х         | Х          | 800ps |
| SW    | Х          | Х          | Х           | Х         |            | 700ps |

#### Maximum clock frequency

$$- f_{max} = 1/800ps = 1.25 GHz$$

#### Most blocks idle most of the time

- $E.g. f_{max.ALU} = 1/200ps = 5 GHz!$
- How can we keep ALU busy all the time?
- 5 billion adds/sec, rather than just 1.25 billion?
- Idea: Factories use three employee shifts equipment is always busy!

#### Agenda

- Finish Single-Cycle RISC-V Datapath
- Controller
- Instruction Timing
- Performance Measures
- Introduction to Pipelining
- Pipelined RISC-V Datapath
- And in Conclusion, ...

#### Performance Measures

- "Our" RISC-V executes instructions at 1.25 GHz
  - -1 instruction every 800 ps

- Can we improve its performance?
  - -What do we mean with this statement?
  - Not so obvious:
    - Quicker response time, so one job finishes faster?
    - More jobs per unit time (e.g. web server returning pages)?
    - Longer battery life?



### Transportation Analogy 🥌



|                    | Sports Car | Bus    |
|--------------------|------------|--------|
| Passenger Capacity | 2          | 50     |
| Travel Speed       | 200 mph    | 50 mph |
| Gas Mileage        | 5 mpg      | 2 mpg  |

#### 50 Mile trip:

|                         | Sports Car | Bus         |
|-------------------------|------------|-------------|
| Travel Time             | 15 min     | 60 min      |
| Time for 100 passengers | 750 min    | 120 min     |
| Gallons per passenger   | 5 gallons  | 0.5 gallons |

## **Computer Analogy**

| Transportation          | Computer                                                                                              |
|-------------------------|-------------------------------------------------------------------------------------------------------|
| Trip Time               | Program execution time: e.g. time to update display                                                   |
| Time for 100 passengers | Throughput: e.g. number of server requests handled per hour                                           |
| Gallons per passenger   | Energy per task*: e.g. how many movies you can watch per battery charge or energy bill for datacenter |

\* Note: power is not a good measure, since low-power CPU might run for a long time to complete one task consuming more energy than faster computer running at higher power for a shorter time

#### "Iron Law" of Processor Performance

```
<u>Time</u> = <u>Instructions</u> <u>Cycles</u> <u>Time</u>

Program * Instruction * Cycle
```

#### Instructions per Program

#### Determined by

- Task
- Algorithm, e.g. O(N²) vs O(N)
- Programming language
- Compiler
- Instruction Set Architecture (ISA)

### (Average) Clock cycles per Instruction

#### Determined by

- ISA (CISC versus RISC)
- Processor implementation (or microarchitecture)
- E.g. for "our" single-cycle RISC-V design, CPI = 1
- Superscalar processors, CPI < 1 (next lecture)</li>

# Time per Cycle (1/Frequency)

#### Determined by

- Processor microarchitecture (determines critical path through logic gates)
- Technology (e.g. 14nm versus 28nm)
- Power budget (lower voltages reduce transistor speed)

33

### Speed Tradeoff Example

• For some task (e.g. image compression) ...

|                | Processor A | Processor B |
|----------------|-------------|-------------|
| # Instructions | 1 Million   | 1.5 Million |
| Average CPI    | 2.5         | 1           |
| Clock rate f   | 2.5 GHz     | 2 GHz       |
| Execution time | 1 ms        | 0.75 ms     |

Processor B is faster for this task, despite executing more instructions and having a lower clock rate!

#### **Energy per Task**

```
Energy=InstructionsEnergyProgramProgram* InstructionEnergyαInstructions* C V²ProgramProgram
```

"Capacitance" depends on technology, microarchitecture, circuit details

Supply voltage, e.g. 1V

Want to reduce capacitance and voltage to reduce energy/task<sub>35</sub>

### **Energy Tradeoff Example**

"Next-generation" processor (Moore's law)

Capacitance, C:

reduced by 15 %

– Supply voltage, V<sub>sup</sub>:

reduced by 15 %

– Energy consumption:

 $(.85C)(.85V)^2 = .63E = > -39 \%$  reduction

- Significantly improved energy efficiency thanks to
  - Moore's Law AND
  - Reduced supply voltage

### Energy "Iron Law"

Performance = Power \* Energy Efficiency (Tasks/Second) (Joules/Second) (Tasks/Joule)

- Energy efficiency (e.g., instructions/Joule) is key metric in all computing devices
- For power-constrained systems (e.g., 20MW datacenter), need better energy efficiency to get more performance at same power
- For energy-constrained systems (e.g., 1W phone), need better energy efficiency to prolong battery life

### **End of Scaling**

- In recent years, industry has not been able to reduce supply voltage much, as reducing it further would mean increasing "leakage power" where transistor switches don't fully turn off (more like dimmer switch than on-off switch)
- Also, size of transistors and hence capacitance, not shrinking as much as before between transistor generations
- Power becomes a growing concern the "power wall"
- Cost-effective air-cooled chip limit around ~150W

38

### **Processor Trends**



### Break!



10/5/17

### Administrivia

- Project 3.1 released Tuesday night, due next Wednesday (3/7).
- Homework 2 due Friday
- Project party on Wednesday 7-10
- Guerrilla session Thursday 7-9pm
- Midterm 2, March 20, is moved to 8-10PM (was 7-9 on the website)
  - Alternative exam earlier, 6-8PM (so people don't need to be in exams until midnight :)
  - submit exam conflict form if they haven't

CS 61c 41

### Agenda

- Finish Single-Cycle RISC-V Datapath
- Controller
- Instruction Timing
- Performance Measures
- Introduction to Pipelining
- Pipelined RISC-V Datapath
- And in Conclusion, ...

## **Pipelining**

- A familiar example:
  - Getting a university degree



Year 1



Year 2



Year 3



Year 4

- Shortage of Computer scientists (your startup is growing):
  - How long does it take to educate 16,000 students?

### Computer Scientist Education

Option 1: serial



Option 2: pipelining



## Latency versus Throughput

#### Latency

Time from entering college to graduation

Serial4 years

Pipelining4 years

#### Throughput

- Average number of students graduating each year

- Serial 1000

– Pipelining 4000

#### Pipelining

- Increases throughput (4x in this example)
- But does nothing to latency
  - sometimes worse (additional overhead e.g. for shift transition)

### Simultaneous versus Sequential

- What happens sequentially?
- What happens simultaneously?



### Agenda

- Finish Single-Cycle RISC-V Datapath
- Controller
- Instruction Timing
- Performance Measures
- Introduction to Pipelining
- Pipelined RISC-V Datapath
- And in Conclusion, ...

# Pipelining with RISC-V



add t0, t1, t2

or t3, t4, t5

sll t6, t0, t3



instruction sequences

# Pipelining with RISC-V

add t0, t1, t2

or t3, t4, t5

sll t6, t0, t3



|                                     | Single Cycle                   | Pipelining             |
|-------------------------------------|--------------------------------|------------------------|
| Timing                              | t <sub>step</sub> = 100 200 ps | $t_{cycle}$ = 200 ps   |
|                                     | Register access only 100 ps    | All cycles same length |
| Instruction time, $t_{instruction}$ | = t <sub>cycle</sub> = 800 ps  | 1000 ps                |
| Clock rate, $f_s$                   | 1/800 ps = 1.25 GHz            | 1/200 ps = 5 GHz       |
| Relative speed                      | 1 x                            | 4 x                    |

### Sequential vs Simultaneous

What happens sequentially, what happens simultaneously?



### Agenda

- Finish Single-Cycle RISC-V Datapath
- Controller
- Instruction Timing
- Performance Measures
- Introduction to Pipelining
- Pipelined RISC-V Datapath
- And in Conclusion, ...

### And in Conclusion, ...

- Controller
  - Tells universal datapath how to execute each instruction
- Instruction timing
  - Set by instruction complexity, architecture, technology
  - Pipelining increases clock frequency, "instructions per second"
    - But does not reduce time to complete instruction
- Performance measures
  - Different measures depending on objective
    - Response time
    - Jobs / second
    - Energy per task